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Authors' abstract 



When implementing parallel programs, it is important to find strategies for control- 
ling parallelism that make the most effective use of available resources. In this paper, 
we introduce a dynamic strategy called WorkCrews for controlling the use of paral- 
lelism on small-scale, tightly-coupled multiprocessors. In the WorkCrew model, tasks 
are assigned to a finite set of workers. As in other mechanisms for specifying paral- 
lelism, each worker can enqueue subtasks for concurrent evaluation by other workers 
as they become idle. The WorkCrew paradigm has two advantages. First, much of the 
work associated with task division can be deferred until a new worker actually un- 
dertakes the subtask, and avoided altogether if the original worker ends up executing 
the subtask serially. Second, the ordering of queue requests under WorkCrews favors 
coarse-grained subtasks, which reduces further the overhead of task decomposition. 
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1. Overview 

In implementing parallel programs, identifying the opportunities for concurrency is only part of the 
problem. In most cases, it is equally important to recognize that unrestricted parallelism can lead to 
inefficiency. When this occurs, it is important to find strategies for controlling parallelism in order to 
make the most effective use of available resources. In this paper, we introduce a dynamic strategy for 
controlling the use of parallelism on small-scale, tightly-coupled multiprocessors. That strategy is based 
on WorkCrews — a scheduling abstraction for parallel programs originally developed for a parallel C 
compiler in Mark Vandevoode's Master's thesis [Vandevoorde88]. 

In the WorkCrew model, the decision to subdivide a task does not rely solely on instantaneous informa- 
tion about processor availability. Instead, potential task subdivisions are queued for execution by any 
processor which becomes available while the requesting processor is busy. Using a queue extends the 
window during which the subdivision can be made and increases the available parallelism. 

The idea of using a pool of cooperating workers reading tasks from a queue is not new. This idea was 
used in the C.mmp and Cm* projects at Carnegie-Mellon [01einick78, Gehringer87, Hibbard78] and 
forms the basis of the problem heap paradigm used at Aarhus University in Denmark [Mflller- 
Nielsen85]. It is also related to the implementation strategy used for "strips" in the BBN Pluribus 
[Omstein75] and for "futures" in Multilisp [Halstead85]. Although based on this old idea, the 
WorkCrew model offers two important advantages. First, the WorkCrew strategy often makes it possi- 
ble to avoid some of the overhead associated with task subdivision. This is accomplished by permitting 
"lazy evaluation" of any work that is required only when concurrent execution actually occurs. When 
tasks are processed serially, these costs can be avoided, resulting in significant efficiency increases for 
some applications. Second, the ordering of task queue entries under the WorkCrew strategy follows the 
recursive decomposition of the problem. This gives preference to coarse-grained decompositions that 
usually offer the best opportunities for speedup. 

In section 2, we establish the context for this work by illustrating the problems associated with unbri- 
dled concurrency. Section 3 offers a simple solution to this problem and discusses some of the oppor- 
tunities for improvement that remain. Section 4 presents the basic WorkCrew mechanism, and section 
5 outlines an adaptation of the basic model necessary to retain intuitive procedural semantics. Section 6 
demonstrates the use of lazy evaluation to reduce the overhead cost in decomposition. Section 7 and 8 
discuss implementation and performance, respectively, and we offer some general conclusions in Sec- 
tion 9. 

2. The need to limit concurrency 

On a typical multiprocessor, there is no performance advantage in having more runnable threads than 
available processors. Instead, this situation represents a performance liability, since the additional 
threads imply increased scheduler overhead. To illustrate the importance of controlling parallelism, this 
section develops a simple parallel implementation of the standard Quicksort algorithm [Bentley84, 
Hoare62, Sedgewick78]. Figure 1 illustrates a Modula-2+ implementation of Quicksort that does not 
involve concurrency. Modula-2+ [Rovner85] is an extension of Wirth's standard Modula-2 [Wirth85] 
that includes new primitives in support of concurrency. These extensions are supplied via the Thread 
interface [Birrell87, Birrell89], which makes it possible to create new lightweight processes, that is, 
processes that share the same address space. 

For small numbers of items, Quicksort is less efficient than other sorting algorithms. Therefore, the 
Quicksort procedure is optimized to call SelectionSort when there are fewer than MinQuick elements 
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PROCEDURE Quicksort(array: ReflntArray; low, high: INTEGER); 
VAR 

boundary: INTEGER; 
BEGIN 

IF high - low < MinQuick THEN 
SelectionSort(array, low, high); 
ELSE 

boundary := Partition(array, low, high); 
Quicksort(array, low, boundary-1); 
Quicksort(array, boundary+1, high); 
END; 
END Quicksort; 

Figure 1 
SerialQuicksort 



to be sorted. In the general case, the Quicksort procedure calls Partition to divide the array into parti- 
tions which satisfy the properties 

array"[i] ^ pivot low < i < boundary 
array*[i] = pivot i = boundary 
array"[i] > pivot boundary < i < high 

where pivot is an element chosen by Partition, and boundary is the index of that element. 

In this implementation, the two recursive calls to Quicksort are entirely independent and could easily 
be performed in parallel. The most obvious strategy is simply to fork a new thread at each recursive 
subdivision. Coding this in Modula-2+ gives rise to the "fork always" implementation shown in Figure 
2. 

Before discussing this example in detail, a few notes on the presentation are required. This example 
introduces the primitives Thread.Fork and Thread Join which provide the basic mechanism for con- 
currency in Modula-2+. Thread.Fork( proc, arg) creates a new thread of control executing proc(arg) 
and returns a handle of type Thread.T which can later be passed to Thread Join to wait for completion 
of that thread. Since the Modula-2+ implementation of Thread.Fork allows only a single argument 
(and we will make use of the limitation in section 6), the individual arguments passed to Quicksort 
must be assembled into an argument block which can then be passed as a unit Thus, in 
ForkAlwaysQuicksort and the examples which follow, we will make use of the type definition 

TYPE 

ArgBIock = REF RECORD 

array: ReflntArray; 

low, high: INTEGER; 
END; 

and assume the existence of a procedure CreateArgBlock which assembles a new block from its argu- 
ments. This change in the argument structure also explains the need for two procedures in this exam- 
ple. Quicksort itself is used only at the outer level and implements precisely the same abstraction as 
the Quicksort procedure in the serial example. QSort is a private procedure, suitable for concurrent 
invocation. 
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PROCEDURE Quicksort(array: ReflntArray; low, high: INTEGER); 
BEGIN 

QSort(CreateArgBlock(array, low, high)); 
END Quicksort; 

PROCEDURE QSort(args: ArgBIock); 
VAR 

boundary: INTEGER; 
parti, part2: ArgBIock; 
child: Thread.T; 
BEGIN 

WITH args" DO 

IF high - low < MinQuick THEN 
SelectionSort(array, low, high); 
ELSE 

boundary := Partition(array, low, high); 
parti := CreateArgBlock(array, low, boundary-1); 
part2 := CreateArgBlock(array, boundary+1, high); 
child := Thread.Fork(QSort, part2); 
QSort(partl); 
Thread. Join(child) ; 
END; 
END; 
END QSort; 

Figure 2 
ForkAIwaysQuicksort 



In ForkAIwaysQuicksort, Thread.Fork is used to create a new thread to execute one of the recursive 
calls to QSort while the original thread performs the other. Threadjoin is used to ensure that both 
operations are complete before the call on QSort returns. When this occurs, both threads have com- 
pleted their work, and the array is sorted. 

ForkAIwaysQuicksort is easy to code, but not particularly practical. Even though Modula-2+ threads 
are reused whenever possible during the execution of a program, ForkAIwaysQuicksort creates an 
excessive number of threads — far more than the number of processors. On a 10,000-element array, the 
resulting overhead is so severe that the program runs 5.9 times more slowly than SerialQuicksort on 
the Firefly [Thacker87], which served as the base for our experimentation. Most of the additional time 
is spent creating, forking, and joining threads. 

3. Avoiding excess concurrency — the fork-when-idle strategy 

One simple and effective way to avoid unproductive parallelism is to use a fork-when-idle strategy: at 
each division point, check to see if there are any idle processors; if so, perform the task in parallel, oth- 
erwise do it serially. Figure 3 illustrates a procedure QSort that implements this strategy. 

In ForkWhenldleQuicksort, the global variable nldle is used to maintain a count of the number of 
idle processors. Since this variable may be referenced simultaneously by several independent threads, 
the lock nldleMutex is required to control access to nldle. After calling Partition, QSort checks if 
nldle is greater than zero. If so, it performs the recursive calls to QSort in parallel as in 
ForkAIwaysQuicksort. Otherwise, it performs the calls serially, avoiding the overhead of calling Fork 
and Join. 
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PROCEDURE QSort(args: ArgBIock); 
VAR 

boundary: INTEGER; 
parti, part2: ArgBIock; 
child: Thread.T; 
shouldfork: BOOLEAN; 
BEGIN 

WITH args" DO 

IF high - low < MinQuick THEN 
SelectionSort(array, low, high); 
ELSE 

LOCK nldleMutex DO 
shouldfork := (nldle > 0); 
IF shouldfork THEN nldle := nldle - 1; END; 
END; 

boundary := Partition(array, low, high); 
parti := CreateArgBlock(array, low, boundary-1); 
part2 := Create ArgBlock(array, boundary+1, high); 
IF shouldfork THEN 

child := Thread.Fork(QSort, part2); 

QSort(partl); 

LOCK nldleMutex DO nldle := nldle + 1; END; 
Thread Join (child) ; 
ELSE 

QSort(partl); 
QSort(part2); 
END; 
END; 
END; 
END QSort; 

Figure 3 
ForkWhenldleQuicksort 



The fork-when-idle strategy succeeds in avoiding almost all of the overhead of dividing a task for paral- 
lel execution. At each division point, the question "is a processor free now?" is posed to choose 
between the serial and parallel options. This test is too strict, however, since the decision is made on 
the basis of an instantaneous snapshot Of the processor utilization. A better criterion is "will a proces- 
sor become free while there is still a possibility of sharing responsibility for this task?". This is the test 
used in WorkCrews. 

4. The basic WorkCrew strategy 

In the WorkCrew paradigm, a set of worker threads cooperate to perform divisible tasks. When a 
worker has a task that can be divided into two possibly concurrent subtasks, it begins work on one of 
the subtasks and queues a help request for the other. When it finishes the first subtask, it checks to see 
if its help request was answered by another worker. If so, the original worker assumes that the opera- 
tion is in good hands and returns. If not, the pending request is canceled, and the original worker com- 
pletes the remainder of the task itself. 
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WorkCrews are created by calling Create(n) where n is the number of worker threads (usually equal to 
the number of processors). The Create procedure returns a handle for the WorkCrew which is then 
passed to the other WorkCrew primitives. New top-level tasks are added by calling 
AddTask(crew, proc, data) where crew is the WorkCrew handle returned by Create, proc is a pro- 
cedure which should be run on the client's behalf, and data is a bundled data value to be passed to the 
procedure. AddTask eventually causes one of crew's workers to execute proc{data). Calling 
Join(crew) suspends the caller until all of crew's tasks have been completed. 

A procedure performing a task may subdivide its work by calling RequestHelp( proc, data). 
RequestHelp is similar to Thread.Fork, except that the call to proc(data) is not performed unless and 
until a worker becomes idle. After queuing this subtask, the worker that called RequestHelp must 
eventually issue a corresponding call to GotHelp. If another worker has answered the help request in 
the interim, GotHelp returns TRUE and the original worker can move on to other tasks. If not, the 
call to GotHelp cancels the pending request, and the original worker must complete the remainder of 
the task itself. Calls to RequestHelp and GotHelp may be nested, but it is the responsibility of the 
client to ensure that each call to RequestHelp is properly paired with a corresponding GotHelp. 

Figure 4 presents the WorkCrew version of Quicksort. After the array has been partitioned, 
WorkCrewQuicksort requests that the second recursive call to QSort be performed in parallel. It then 
calls QSort on the first half of the subdivided array, and, on returning, uses GotHelp to determine 
whether its earlier request for help was answered. If not, it calls QSort on the second half of the array. 

A worker's requests are answered in the order received. Thus, as long as the subdivision follows a 
traditional divide-and-conquer strategy, WorkCrews will favor coarser grains of parallelism over finer 
ones. This improves performance by reducing the time spent dividing tasks. By contrast, the fork- 
when-idle strategy does not distinguish between coarse and fine parallelism in its steady state. As noted 
above, the fork-when-idle strategy may also miss opportunities for parallel execution when WorkCrews 
would not, since the former makes no allowance for the possibility that a processor might soon become 
free. 

5. WorkCrews vs. procedural semantics 

The client must exercise some degree of caution in using WorkCrews. In the absence of a call to 
WorkCrewJoin, procedures that are written to use the WorkCrew paradigm do not necessarily adhere 
to procedural semantics; when a recursive call on QSort returns in the WorkCrewQuicksort example, 
the caller cannot assume that all of the internal work is complete. The procedure guarantees only that 
the necessary operations to complete the call have been initiated, but this work may still be in progress 
when the original worker returns. 

In many practical applications, this fits well with the structure of the problem. In Quicksort, for exam- 
ple, it is sufficient to fork worker tasks for the independent subtasks and then use WorkCrewJoin to 
ensure that all tasks are complete. In other cases, however, it is necessary to provide a more fine- 
grained mechanism for synchronizing the activity of the individual workers. In the WorkCrew inter- 
face, this is accomplished by using the procedure pair EnterSubtaskGroup and JoinSubtaskGroup. 
Like RequestHelp and GotHelp, these are paired operations, and clients must ensure proper bracketing. 

The basic structure of these procedures is illustrated by the program fragment 

WorkCrew.EnterSubtaskGroup(); 
PK); 

WorkCrew JoinSubtaskGroupO ; 
P20; 

In this example, suppose that procedure PI and P2 must both be executed but that P2 can only be 
started when PI is finished. If PI calls RequestHelp, it is not legitimate to assume that PI is complete 
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PROCEDURE Quicksort(array: ReflntArray; low, high: INTEGER); 
VAR 

crew: WorkCrew.T; 
BEGIN 

crew := WorkCrew.Create(n Workers); 

WorkCrew.AddTask(crew, QSort, CreateArgBlock(array, low, high)); 
WorkCrew Join(crew) ; 
END Quicksort; 

PROCEDURE QSort(args: ArgBlock); 
VAR 

boundary: INTEGER; 
parti, part2: ArgBlock; 
BEGIN 

WITH args" DO 

IF high - low < MinQuick THEN 
SeIectionSort(array, low, high); 
ELSE 

boundary := Partition(array, low, high); 

parti := CreateArgBlock(array, low, boundary-1); 

part2 := CreateArgBlock(array, boundary+1, high); 

WorkCrew.RequestHeIp(QSort, part2); 

QSort(partl); 

IF NOT WorkCrew.GotHelpO THEN 

QSort(part2); 
END; 
END; 
END; 
END QSort; 

Figure 4 
WorkCrewQuicksort 



when the worker who started PI returns — other workers who came to help with task PI may not yet 
be complete. 

When a worker calls JoinSubtaskGroup, it blocks until all RequestHelp operations issued since the 
corresponding EnterSubtaskGroup are complete. To maintain a constant number of active workers, 
the WorkCrew implementation automatically activates a new worker while the caller of 
JoinSubtaskGroup is blocked. 

To demonstrate where the use of these primitives is required, consider once again the Quicksort exam- 
ple. The implementation from the last section misses an important opportunity for parallelism. In that 
implementation, the Partition function makes a sequential pass over the entire array before any oppor- 
tunity for concurrency arises. If parallelism could also be used here, the total running time of the algo- 
rithm could be reduced further. 

One strategy is to implement a new routine PartitionSubset which can be used to partition only the 
even or odd-numbered elements. If the same pivot value is chosen (55 in the example below), the 
result after both calls on PartitionSubset will be an array divided into three parts, as shown in Figure 
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5. To the left of position A, all elements are less than or equal to the pivot; to the right of B, all ele- 
ments are larger. 
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Figure 5 
Odd/Even Partitioning 

Between A and B, the elements may be jumbled, but this can be repaired by an additional step which 
partitions all elements in that range using the same pivot. 

The advantage here is that the first two calls to PartitionSubset may be made in parallel, since the ele- 
ments they touch are disjoint. If this parallelism is achieved using WorkCrews, however, the final call 
on Partition cannot be started until the first two are complete. This means that an 
EnterSubtaskGroup/JoinSubtaskGroup pair are required, as shown in the ParallelPartitionQuicksort 
example in Figure 6. 

6. Deferring the cost of task decomposition 

The Quicksort example illustrates the structure of the WorkCrew mechanism, but does not demonstrate 
one of the most important advantages of WorkCrews: the ability to defer the cost of splitting a task into 
component parts until that division is actually performed. The structure of the parallel Quicksort pro- 
gram is an example of "embarrassing parallelism." The only significant cost in the decomposition is the 
cost of the fork and join. 

In most practical applications, splitting a task into concurrent subtasks requires some additional compu- 
tation. Many applications that perform intermediate computation steps in parallel will need to serialize 
the output of those computations. This is an example of overhead which is present in the parallel 
decomposition but unnecessary in the traditional sequential coding. 

When this sort of overhead exists, it is even more important to avoid splitting a task when there are not 
enough processors to carry out the computations concurrently. In such cases, dividing the task incurs 
not only the overhead of the fork and join, but also the cost of any operations required to manage the 
decomposition. The WorkCrew mechanism makes it rather easy to avoid these costs when there are not 
enough processors to warrant task division. 

For example, suppose that Subtaskl and Subtask2 are two subtasks that could conceivably be executed 
in parallel. When the subtasks are executed serially, no additional work is required; if the task is 
divided, some additional overhead is incurred. We assume that the overhead cost of splitting the task 
can be encapsulated in a function PrepareTheData which updates the argument block to reflect the 
required additional processing. Using the traditional fork/join mechanism, this would be represented as: 

child := Thread.Fork(Subtask2, PrepareTheData(args)); 

Subtaskl(args); 

Thread Join(child) ; 

In the WorkCrew model, the key observation is that PrepareTheData need only be called if the task is 
actually divided and not when the request for help is posted. To make this possible, 
WorkCrevv.RequestHelp accepts an optional third argument which is a "data preparer." The data 
preparer is stored with the task request and called whenever a worker in the crew actually takes on the 
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TYPE 

SubsetType = (Odd, Even, All); 

PartitionBlock = REF RECORD 

array: RefTnt Array; 

low, high: INTEGER; 

pivot: INTEGER; 

type: SubsetType; 

result: INTEGER; 
END; 

PROCEDURE Partition(array: ReflntArray; low, high: INTEGER) : INTEGER; 
VAR 

pivot, lowp, highp: INTEGER; 
odds, evens, combined: PartitionBlock; 
BEGIN 

pivot := ChoosePivot(array, low, high); 

odds := CreatePartitionBlock(array, low, high, pivot, Odd); 

evens := CreatePartitionBIock(array, low, high, pivot, Even); 

WorkCrew.EnterSubtaskGroupO; 

Wor kCrew.RequestHelp (PartitionSubset, evens) ; 

PartitionSubset(odds) ; 

IF NOT WorkCrew.GotHelpO THEN 
PartitionSubset(evens) ; 

END; 

WorkCrewJoinSubtaskGroupO; 
lowp := MIN(odds" .result, evens" .result); 
highp := MAX(odds* .result, evens'.result); 
combined := CreatePartitionBlock(array, lowp, highp, pivot, All); 
PartitionSubset(combined) ; 
RETURN combined\result; 
END Partition; 

Figure 6 
ParallelPartitionQuicksort 



subtask. Thus, in the WorkCrew case, the example above would be coded as 

WorkCrew.RequestHelp(Subtask2, args, PrepareTheData); 
Subtaskl(args); 

IF NOT WorkCrew.GotHelpO THEN 

Subtask2(args); 
END; 

Note that PrepareTheData is called only if another worker actually takes over the subtask operation. 
If the original worker manages to complete Subtaskl before any helper arrives to handle the request for 
help with Subtask2, this degenerates into the purely sequential case and the additional processing is not 
required. In effect, the data preparer is applied "lazily" so that it is called only when needed. 
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To make this aspect of WorkCrews more concrete, consider the problem of implementing a file search- 
ing program (like Unix grep) that can process several independent files in parallel. As a primitive, 
assume that you have a procedure Grep which takes three arguments: the pattern string, the name of a 
single file on which to operate, and a stream on which to write the output (such streams are called 
"writers" in the Modula-2+ environment and have type Wr.T). Thus, to find and print on standard 
output all lines containing "xyzzy" in the file adventure.txt, we could call 

Grep("xyzzy", "adventure.txt", stdout); 

Unfortunately, concurrent execution of this operation on two different files at the same time would be 
inappropriate, since the output would be hopelessly interleaved. Instead, we need some mechanism to 
serialize the output stream. 

Fortunately, such a mechanism (called "splitwriters") exists in the Modula-2+ library. Given a split- 
writer wl, the statement 

w2 := SplitWriter.Split(wl); 

yields a second splitwriter w2 so that everything written to wl will precede in the eventual output 
stream everything written to w2. Internally, any writes to w2 are buffered until wl is closed and then 
dumped on the output stream. Either descendent of the SplitWriter.SpIit operation can be split arbi- 
trarily often. 

This makes it possible to code a procedure MultiGrep which calls the Grep procedure on a list of files. 
Splitting a writer has an associated overhead cost, however, and we would like to avoid this whenever 
possible. The WorkCrew structure makes this quite convenient since the SplitWriter operation can be 
included in the data preparer function and thus be invoked only when the task is actually split. 

For concreteness, assume that MultiGrep takes an argument block of the following form: 

TYPE 

TaskBlock = REF RECORD 

files: TextRefArray; 

low, high: INTEGER; 

pattern: TextT; 

outfile: Wr.T; 
END; 

The MultiGrep program itself is shown in Figure 7. 
7. Implementation 

Our implementation of WorkCrews is based on the Thread module of Modula-2+. In section 2, we 
explained the Thread abstraction and made use of the operations Fork and Join. The Thread module 
also provides abstractions for locks and conditions. Locks have the operations Acquire and Release, 
which are automatically generated by the compiler when the LOCK statement is used. The principal 
operations of conditions are Signal and Wait. A thread calls Wait to block until a condition occurs. 
Signal is called to indicate that a condition has occurred. In designing our implementation of 
WorkCrews, we also exploited several properties of the Thread module. In the absence of contention, 
acquiring and releasing a lock costs a total of only five instructions. Signaling a condition on which no 
thread is waiting costs only two instructions. In other cases, operations on locks and conditions require 
system calls. 

The description of our WorkCrew implementation is divided into two sections. First, we describe a 
basic implementation without support for subtask groups. Later, we describe the extensions required for 
managing subtask groups. 
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PROCEDURE MultiGrep(args: TaskBlock); 
VAR 



parti, part2: TaskBlock; 
midpoint: INTEGER; 
BEGIN 

WITH args" DO 

IF low > high THEN RETURN END; 
IF low = high THEN 

Grep(pattern, fiIelist*[low], out file); 
ELSE 

midpoint := (low + high) DIV 2; 
parti := args; 

part2 := CopyTaskBlock(partl); 
partl'.high := midpoint; 
part2Mow := midpoint + 1; 

WorkCrew.RequestHeIp(MuItiGrep, part2, SplitTheWriter); 
MultiGrep(partl); 

IF NOT WorkCrew.GotHelpO THEN 



END; 
END MultiGrep; 

PROCEDURE SplitTheWriter(args: TaskBlock) : TaskBlock; 
BEGIN 

args'.outfile := SplitWriter.Split(args".outfile); 
RETURN args; 
END SplitTheWriter; 



7.1 Basic implementation 

WorkCrews are implemented using two principal data types: Crew and Worker. A Crew is a tuple 
including: 



MultiGrep(part2); 



END; 



END; 



Figure 7 
MultiGrep 



workers 

taskQueue 

noMoreTasks 



nForked 
nRetired 
nldle 



allDone 
wakeWorker 



the set of all Workers in this Crew 

a queue of unstarted tasks created by AddTask 

a flag indicating whether Join has been called 

the number of workers created 

the number of workers that have exited 

the number of workers waiting for tasks 

the condition to denote completion of all tasks 

the condition to denote that a new task exists 



crewMutex 



or that all tasks have been completed 
a lock to synchronize access to the above 
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A Worker is a tuple including: 



crew 



requests 
sp 

helpedPtr 
workerMutex 



a stack of help requests 

the top of stack pointer for requests 

a pointer into requests 

a lock to synchronize access to the above 

a backpointer to the shared Crew object 



An important invariant in the implementation is that helpedPtr is between sp and the base of requests, 
inclusive. All help requests between the base and helpedPtr have been answered, and all others have 
not. 

Most of the WorkCrew operations have straightforward implementations. Create(n) initializes a new 
Crew and n new Workers, forking a new thread for each worker. AddTask simply enqueues a task in 
taskQueue while holding the crewMutex lock and wakes up an idle worker, if any exist Join sets the 
flag noMoreTasks to be TRUE and waits for the condition allDone. 

Each worker thread executes the internal procedure WorkerRoot shown in Figure 8. WorkerRoot is 
the key to understanding how a worker finds a task and, when none exist, how a worker determines 
whether to exit or wait for new tasks to appear. 



PROCEDURE WorkerRoot(me: Worker) 
VAR 

task: Task; 

crew: Crew; 
BEGIN 

crew := me" .crew; 

LOOP 

IF FindTask(me, task) THEN 
DoTask(me, task); 

IF CompletedSubtaskGroup(task) THEN 

WakeJoinSG(task, crew); 

EXIT; 
END; 
ELSE 

LOCK crew'.crewMutex DO 
IF AllTasksDone(crew) THEN 
Terminate(crew); 
EXIT; 
ELSE Block(crew) 
END; 
END; 
END; 
END; 
END WorkerRoot; 



FindTask first checks the taskQueue for unstarted tasks. If the queue is empty, then it searches for a 
Worker with unanswered help requests. Testing for unanswered requests is fast since it requires only a 
pointer comparison of sp and helpedPtr. FindTask extracts the first unanswered request it finds, 
adjusts helpedPtr to mark the request as answered, and returns TRUE. If no unanswered request is 
found, it returns FALSE. 



Figure 8 
WorkerRoot 
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DoTask performs the task just found by FindTask. Note that this may involve calling a data preparer 
if the task is from a help request. The call to CompletedSubtaskGroup is related to subtask manage- 
ment, which will be described shortly. In the absence of subtask groups, the call always returns 
FALSE and the worker simply goes to the beginning of the loop. 

When FindTask returns FALSE, WorkerRoot checks for termination by calling AHTasksDone. The 
criteria for termination are that Join has been called (noMoreTasks is TRUE), and that all workers 
other than the one executing AHTasksDone are blocked (nRetired + nldle = nForked - 1). By calling 
Terminate, the first worker to detect termination wakes all other blocked workers, which then exit in 
succession. Terminate also signals allDone to wake the thread that called Join. 

If AHTasksDone returns FALSE, there is a possibility that new tasks will be created. Therefore, 
WorkerRoot calls Block to mark the worker as idle and suspend execution until either a new task is 
created or termination is detected (condition wake Worker). 

We now turn to the implementations of RequestHelp and GotHelp. These operations must be fast 
since they are called frequently. RequestHelp simply pushes its arguments onto the worker's requests 
stack, while holding workerMutex, and then calls Signal to wake an idle worker, if any exist. Typi- 
cally, the calls to Acquire, Release, and Signal will be the efficient ones described at the start of this 
section: there is little contention for workerMutex since each worker has its own, and a worker only 
briefly acquires another worker's workerMutex when searching for a task to perform. Also, if we 
assume that there is an excess of parallelism, workers will seldom block waiting for tasks, so the call to 
Signal in RequestHelp is efficient. 

The GotHelp operation is similarly fast. It acquires workerMutex and pops the requests stack. If 
helpedPtr > sp, then the request was answered, and GotHelp sets helpedPtr to sp to maintain the 
invariant on the stack of requests. Finally, it releases workerMutex. As in RequestHelp, the lock 
operations are typically fast since contention for each workerMutex is low. 

7.2 Managing subtask groups 

Given the implementation of WorkCrews described thus far, only a few extensions are required to sup- 
port subtask groups. A subtask group is simply a set of tasks created by RequestHelp bracketed by 
calls to EnterSubtaskGroup and JoinSubtaskGroup. When a call to RequestHelp is bracketed by 
more than one enter/join pair, the task it creates belongs to the innermost enter/join pair. 

The purpose of JoinSubtaskGroup is to suspend its caller until all tasks created since the correspond- 
ing call to EnterSubtaskGroup are completed. Because the subtask group operations must follow 
stack discipline, we observe that tasks in nested subtask groups will be completed when 
JoinSubtaskGroup is called. 

Our strategy for managing subtask groups is to maintain a count of the number of workers performing 
tasks in each subtask group. A worker increments the count when it begins a task in the group, and 
decrements the count when it finishes the task. JoinSubtaskGroup blocks until the count is zero, and 
the last worker to finish a task in the group is responsible for waking the join, if necessary. 

We introduce a new type, SubtaskGroup, which is a tuple: 



count 

groupMutex 
subtaskDone 
prev 



the number of workers cooperating on this subtask group 

a lock to synchronize access to count 

a condition used to detect completion of a subtask group 

a pointer to the smallest enclosing SubtaskGroup (NIL if none) 
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In addition, we augment the Worker type to include the component: 

groupStack the most deeply nested SubtaskGroup for which 

worker is executing (NIL if there is none). 

Semantically, groupStack behaves like a stack to the particular worker (entries are chained using the 
prev component). When viewed together, the worker groupStacks form an inverted tree. 

The operation EnterSubtaskGroup pushes a new SubtaskGroup onto groupStack. The new 
SubtaskGroup has a count of one to indicate that a single worker (the one executing 
EnterSubtaskGroup) is computing the subtask. 

JoinSubtaskGroup decrements the count in the top of the groupStack. If the result is zero, then all 
workers that ever cooperated in performing the subtask have finished, so JoinSubtaskGroup pops the 
groupStack and returns. Otherwise, it blocks until the count is zero by waiting for the condition 
subtaskDone. Since the number of active workers should remain constant, a new worker thread is 
created before blocking. Once awakened, JoinSubtaskGroup pops groupStack. 

A minor extension to RequestHelp is necessary to maintain the association of each task with its correct 
SubtaskGroup. Each entry in the requests stack is augmented to include a pointer to its associated 
SubtaskGroup. RequestHelp initializes this pointer to be the worker's groupStack at the time of the 
call. Note that this is consistent with our earlier definition that a help request's subtask group is the one 
created by the last EnterSubtaskGroup call prior to that request. 

The final extensions involve the internal procedures for selecting and performing tasks. When 
FindTask chooses to answer a help request, it initializes task's SubtaskGroup to be the one noted in 
the help request. DoTask sets the worker's groupStack to refer to the same SubtaskGroup and incre- 
ments the count therein to reflect the activity of the worker answering the request. 

The corresponding decrement of the count occurs when WorkerRoot calls CompletedSubtaskGroup 
(see Figure 8). If the SubtaskGroup is NIL, then CompletedSubtaskGroup immediately returns; oth- 
erwise, it decrements the count. If the result is zero, all tasks in the subtask group are completed, so it 
returns TRUE; otherwise it returns FALSE. 

If the subtask group was completed, the WorkerRoot calls WakeJoinSG to wake the worker blocked 
in JoinSubtaskGroup and exits. The effect is to keep the number of active workers constant, since 
JoinSubtaskGroup had created a new worker before blocking. 

As a final note, the implementation avoids deadlock because no worker ever acquires more than one of 
each kind of lock, and multiple locks are always acquired in the order crewMutex, workerMutex, 
groupMutex. 

8. Performance 

To assess the success of the WorkCrew mechanism, we measured the performance of the examples 
presented above on the Firefly [Thacker87]. The Firefly is a shared-memory multiprocessor worksta- 
tion, designed and built at the Digital Equipment Corporation's Systems Research Center in Palo Alto. 

The timings for the five versions of Quicksort are presented in Table I below. In each case, the experi- 
ment was run repeatedly on an otherwise idle Firefly configured with five MicroVAX-II processors. 
The average for ten trials is reported, along with the standard deviation. The speedup is calculated 
using SerialQuicksort as a baseline. The ForkAlways strategy occurs in the table only for 10K ele- 
ments, since anything larger turned out to be infeasible with this strategy. 

As Table I shows, the WorkCrew implementation of Quicksort has a considerable performance advan- 
tage over ForkWhenldle at each of the problem sizes tested. The fact that the speedup is not closer to 
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1 > 


mean time 

(sec) 


standard 
deviation 


speedup 
serial = 1 


Serial 


10K 


3.7 


0.036 


1.00 


ForkAlways 


10K 


21.8 


0.536 


0.17 


ForkWhenldle 


10K 


1.6 


0.084 


2.31 


WorkCrew 


10K 


1.3 


0.017 


2.85 


ParallelPartition 


10K 


1.2 


0.056 


3.08 


Serial 


100K 


46.4 


0.129 


1.00 


ForkWhenldle 


100K 


16.3 


0.273 


2.85 


WorkCrew 


100K 


14.4 


0.143 


3.22 


ParallelPartition 


100K 


12.9 


0.311 


3.60 


Serial 


1M 


560.3 


2.444 


1.00 


ForkWhenldle 


1M 


183.1 


4.427 


3.06 


WorkCrew 


1M 


157.7 


0.773 


3.55 


ParallelPartition 


1M 


147.6 


3.312 


3.80 



Table I 
Quicksort Performance 



the number of processors is due in part to the fact that the algorithm includes a long serial partition step 
during which no parallelism is available. In addition, the basic WorkCrew mechanism and the 
scheduler both introduce some overhead which reduces the speedup below the theoretical limit. 

To evaluate the performance advantage that we can achieve through lazy evaluation of the task decom- 
position overhead, we built two WorkCrew-based implementations of MultiGrep. One implementation 
was the one given in section 6, in which the call to SplitWriter occurs in a separate data preparer rou- 
tine so that this cost is paid only when concurrent evaluation actually occurs. The other was an identi- 
cal implementation in which the SplitWriter call appears directly within the body of the MultiGrep 
procedure. We used these implementations to search for the common string "INTEGER" in a library 
directory containing 229 files. 

In our first experiment, the implementation, which used lazy evaluation to avoid unnecessary 
SplitWriter calls, provided a 2% performance advantage. This seemed disappointingly low until we 
recognized that MultiGrep was disk-limited to such an extent that the entire process was essentially 
serialized, reducing any advantage due to parallelism. When we repeated the experiment starting with 
the files in memory, the increase jumped to 13%, illustrating that splitting the writer represented a 
significant fraction of the total cost 

Note that the strategy used in the subdivision has considerable influence on the savings that can be 
achieved. As written, the MultiGrep example from section 6 divides the list of files in half and issues 
a help request for the second portion. By keeping both subtasks relatively large, this strategy increases 
the chance that each subtask can process several files sequentially without incurring a split. If, on the 
other hand, MultiGrep had been coded to split the task at the next file and request help for the entire 
remainder of the list, workers would tend to leapfrog forward through that list, incurring the overhead 
of the splitwriter for almost every file. When we ran this experiment, the divide-in-half approach ended 
up requiring only 12 split operations when scanning the 229 files, while the split-at-next-file approach 
required 222 splits. 
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In the MultiGrep example, of course, we expected that the mechanics of serializing the output would 
represent a significant fraction of the overhead, since relatively little work is done in the subtasks them- 
selves: definitions files are individually short and the search algorithm is reasonably efficient. In other 
applications, the balance between the actual work and the overhead due to task decomposition will be 
different, and it is hard to predict to what extent this technique will reduce the total time. Based on our 
experience, however, we believe that the use of lazy evaluation to reduce decompositional overhead 
will provide significant efficiency improvements in a variety of practical situations. 

9. Conclusions 

In developing the parallel C compiler [Vandevoorde88] and other applications here at SRC, we have 
found the WorkCrew concept to be a useful mechanism for the efficient subdivision of tasks. Unlike 
traditional mechanisms for expressing parallelism, the WorkCrew strategy makes a dynamic decision 
about the availability of processing resources so that fewer opportunities for parallel decomposition are 
lost. Moreover, the WorkCrew mechanism makes it possible to reduce the overhead associated with 
task decomposition by deferring this cost until the decomposition actually occurs. 
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